The pictionary dataset provided was designed for computational models to attempt to classify (there are only 6 classification words) what word is drawn in a 28 x 28 pixel gray scale image space, much like the popular image recognition game pictionary.There are 784 variables which represent the 28 x 28 grid that is a representation of the input drawing which we attempt to classify using a model. There are 6,000 training observations and 1,200 test observations which require prediction. Each observation represents one drawing grid and which of the 6 word categories it belongs to (each observation must belong to one of the 6 words) as shown below.
In the initial phase, we tested out RandomForest and SVM on the ‘sketches’ data but the accuracy rate was 80% and 82% respectively. Seeking higher accuracy, we approached Convolutional Neural Network(CNN) which is a deeper neural network that has consistently performed well on image recognition in the annual imagenet competition. Table 1 shows the results after testing several architectures such as LENET-5 (Dasaprakash, 2019), VGG16 (Deshmukh, 2018), RESNET50 (Rosebrock, 2017) and 4 other custom architectures modified from these 3 architectures.
In order for neural networks to operate, there are some necessary data cleaning steps required. Firstly, the ‘sketches’ data needs to have numeric inputs, hence all 6 classifications are converted to a number from 0 to 5 in rank of alphabetical order. Next, neural networks are sensitive to the scale of the feature values hence all values from variables 1 to 784 are divided by 255 (maximum value) to standardize values to be between 0 to 1. Since our ‘sketches’ data is categorical, the labels are required to be in binary matrix and this is done by one hot encoding the numerical labels with the to_categorical() function from Keras. Lastly, convolutional neural networks require the observation data to be in 4 dimensions, therefore, using the array_reshape() function, the ‘sketches’ variables were reshaped into 4 dimension data.
The highest validation and training accuracy came from our custom 1 (Govoruha, 2019) architecture (figure 1) which has 7 layers and adopts the initial layering structure of VGG16 with the addition of batch normalization and dropout. The architecture of custom 1 is built based on the VGG16 structure, however, instead of going deeper with more 3x3 convolutional layers, a 8x8 kernel is used to increase the receptive field and incorporate more information (Ghosh, 2017). The process of CNN is illustrated in figure 2.